The gym chain Model Fitness is developing a customer interaction strategy based on analytical data.One of the most common problems gyms and other services face is customer churn. How do you know if a customer is no longer with you? You can calculate churn based on people who get rid of their accounts or don't renew their contracts. However, sometimes it's not obvious that a client has left: they may walk out on tiptoes. Churn indicators vary from field to field. If a user buys from an online store rarely but regularly, you can't say they're a runaway. But if for two weeks they haven't opened a channel that's updated daily, that's a reason to worry: your follower might have gotten bored and left you.
For a gym, it makes sense to say a customer has left if they don't come for a month. Of course, it's possible they're in Cancun and will resume their visits when they return, but's that's not a typical case. Usually, if a customer joins, comes a few times, then disappears, they're unlikely to come back.
In order to fight churn, Model Fitness has digitized a number of its customer profiles. Your task is to analyze them and come up with a customer retention strategy.
We will use data from gym_churn_us.csv
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
import plotly.express as px
from sklearn.cluster import KMeans
from plotly.subplots import make_subplots
import math
import warnings
warnings.filterwarnings('ignore')
from plotly import graph_objects as go
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.metrics import silhouette_score
data = pd.read_csv('/datasets/gym_churn_us.csv')
data.sample()
data.columns = map(str.lower, data.columns)
data.head()
data.info()
data.describe()
print('Duplicates amount in the dataframe:', data.duplicated().sum())
data.isna().sum()
data.isnull().sum()
There are neither missing nor duplicated values
data1 = data.groupby('churn').mean().reset_index()
data1
subplot_titles=('gender', 'near_location', 'partner', 'promo_friends','phone', 'contract_period',
'group_visits', 'age','additional_charges_total', 'month_to_end_contract',
'lifetime', 'class_freq_total','class_freq_current_month')
fig = make_subplots(rows=7, cols=2,
subplot_titles=subplot_titles,vertical_spacing=0.04)
idx = 0
r = (math.floor(idx/2) + 1)
c = (idx%2 + 1)
legend = True
for i in data.columns.values[0:13]:
fig.add_trace(go.Histogram(x=data.query('churn == "1"')[i],
name='churn=yes', legendgroup='churn',
marker = {'color':'DodgerBlue'},
showlegend=legend),
row=r, col=c)
fig.add_trace(go.Histogram(x=data.query('churn == "0"')[i],
name='churn=no', legendgroup='nochurn',
marker = {'color':'DarkTurquoise'},
showlegend=legend),
row=r, col=c)
idx = idx+1
r = (math.floor(idx/2) + 1)
c = (idx%2 + 1)
legend = False
fig.update_xaxes(type="category", row=3, col=2)
fig.update_layout(height=1700)
fig.show()
can be spotted that the staying group (not churn) has highier means at most of the features such as contract periode, group visits and partner. Also they spent more money on additional services, renewed their subscription and visit more often. The age and gender parameters have less influence on the churn. In addition, can be seen that age distributes normaly, class_freq_current_month wich stays very close to normal distribution *
corr_data = data.corr()
plt.figure(figsize = (15,15))
plt.title('Correlation Matrix of Features', fontsize=18)
sns.heatmap(corr_data, square=True, annot=True)
plt.show()
There are strong correlations between:
Build a binary classification model for customers where the target feature is the user's leaving next month. Divide the data into train and validation sets using the train_test_split() function. Train the model on the train set with two methods:
X = data.drop('month_to_end_contract', axis = 1)
y = data['month_to_end_contract']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# define the algorithm for the logistic regression model
lr_model = LogisticRegression()
# train the model
lr_model.fit(X_train, y_train)
# use the trained model to make predictions
lr_predictions = lr_model.predict(X_test)
lr_probabilities = lr_model.predict_proba(X_test)[:,1]
# define the algorithm for the new random forest model
rf_model = RandomForestClassifier(n_estimators = 100)
# train the random forest model
rf_model.fit(X_train, y_train)
# use the trained model to make predictions
rf_predictions = rf_model.predict(X_test)
rf_probabilities = rf_model.predict_proba(X_test)[:,1]
print('Metrics for logistic regression:')
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, lr_predictions)))
print('Precision: {:.2f}'.format(precision_score(y_test, lr_predictions,average='weighted')))
print('Recall: {:.2f}'.format(recall_score(y_test, lr_predictions,average='weighted')))
print('F1: {:.2f}'.format(f1_score(y_test, lr_predictions,average='weighted')))
print('Metrics for random forest:')
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, rf_predictions)))
print('Precision: {:.2f}'.format(precision_score(y_test, rf_predictions,average='weighted')))
print('Recall: {:.2f}'.format(recall_score(y_test, rf_predictions,average='weighted')))
print('F1: {:.2f}'.format(f1_score(y_test, rf_predictions,average='weighted')))
Accuracy, Recall and F1 metrics are highier in logistic regression model
Set aside the column with data on churn and identify object (user) clusters:
sc = StandardScaler()
x_sc = sc.fit_transform(X)
linked = linkage(x_sc, method = 'ward')
sc = StandardScaler()
x_sc = sc.fit_transform(X)
linked = linkage(x_sc, method = 'ward')
plt.figure(figsize=(15, 10))
dendrogram(linked, orientation='top')
plt.title('Hierarchical clustering for GYM')
plt.show()
By looking the resulting graph we can estimate that there are 5 clusters that we can single out.
# standardize the data
sc = StandardScaler()
x_sc = sc.fit_transform(X)
# define the k_means model with 5 clusters
km = KMeans(n_clusters = 5)
# predict the clusters for observations (the algorithm assigns them a number from 0 to 4)
labels = km.fit_predict(x_sc)
data_for_hists = data.copy()
# store cluster labels into the field of our dataset
#data_for_hists = data.copy()
data_for_hists['cluster_km'] = labels
# print the statistics of the mean feature values per cluster
data_for_hists.groupby(['cluster_km']).mean()
We can spot that the highiest contract period is on cluster 4, also month to end contract is the highiest on the same cluster, it means that recently,the contracts renewed and they are also under high partner category. Also 4th cluster has a pretty high lifetime - they are repeat customers
# binary features plots
groupmode = ['gender','near_location','partner','promo_friends','phone','contract_period',
'group_visits','month_to_end_contract']
fig = make_subplots(rows=4, cols=2,
subplot_titles=groupmode,vertical_spacing=0.07)
idx = 0
r = (math.floor(idx/2) + 1)
c = (idx%2 + 1)
legend = True
for i in groupmode:
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 3')[i],
name='cluster3', legendgroup='cluster3',
marker = {'color':'MediumSlateBlue'},
showlegend=legend),
row=r, col=c)
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 0')[i],
name='cluster0', legendgroup='cluster0',
marker = {'color':'DodgerBlue'},
showlegend=legend),
row=r, col=c)
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 4')[i],
name='cluster4', legendgroup='cluster4',
marker = {'color':'DarkTurquoise'},
showlegend=legend),
row=r, col=c)
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 2')[i],
name='cluster2', legendgroup='cluster2',
marker = {'color':'PaleGreen'},
showlegend=legend),
row=r, col=c)
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 1')[i],
name='cluster1', legendgroup='cluster1',
marker = {'color':'LemonChiffon'},
showlegend=legend),
row=r, col=c)
idx = idx+1
r = (math.floor(idx/2) + 1)
c = (idx%2 + 1)
legend = False
fig.update_xaxes(type="category", row=3, col=2)
fig.update_layout(barmode='group', height=1200)
fig.show()
# continuous features plots
overlaymode = ['age','avg_additional_charges_total','lifetime','avg_class_frequency_total',
'avg_class_frequency_current_month']
fig = make_subplots(rows=3, cols=2,
subplot_titles=overlaymode,vertical_spacing=0.07)
idx = 0
r = (math.floor(idx/2) + 1)
c = (idx%2 + 1)
legend = True
for i in overlaymode:
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 3')[i],
name='cluster3', legendgroup='cluster3',
marker = {'color':'MediumSlateBlue'},
showlegend=legend),
row=r, col=c)
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 0')[i],
name='cluster0', legendgroup='cluster0',
marker = {'color':'DodgerBlue'},
showlegend=legend),
row=r, col=c)
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 4')[i],
name='cluster4', legendgroup='cluster4',
marker = {'color':'DarkTurquoise'},
showlegend=legend),
row=r, col=c)
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 2')[i],
name='cluster2', legendgroup='cluster2',
marker = {'color':'PaleGreen'},
showlegend=legend),
row=r, col=c)
fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 1')[i],
name='cluster1', legendgroup='cluster1',
marker = {'color':'LemonChiffon'},
showlegend=legend),
row=r, col=c)
idx = idx+1
r = (math.floor(idx/2) + 1)
c = (idx%2 + 1)
legend = False
fig.update_layout(barmode='overlay', height=900)
fig.update_traces(opacity=0.75)
fig.show()
Can be seen that cluster 3 are the oldest by age and are also visiting the most. Cluster zero are the youngest and are visiting much less and also have the shortest lifetime
table = data_for_hists.groupby('cluster_km')['churn'].agg(count='count',sum='sum')
table['churn_rate'] = (table['sum']/table['count']*100).round(2)
table
churn_rate_data = data_for_hists.merge(table, on='cluster_km')
churn_rate_data.groupby(['cluster_km']).mean()
The zero cluster has the highiest churn rate of 51.52%, cluster 2 follows him with 44.44% and cluster 1 with 26.68%. Clusters 3 and 4 have lowest churn rate of 7.24% and 2.76% respectively. So clusters 3 and 4 are the most loyal and 0, 2, 1 are prone to leaving
There are strong correlations between:
Accuracy, Recall and F1 metrics are highier in logistic regression model
There are 5 clusters and highiest contract period is on cluster 4, also month to end contract is the highiest on the same cluster, it means that recently,the contracts renewed and they are also under high partner category. Also 4th cluster has a pretty high lifetime - they are repeat customers. Also can be seen that cluster 3 are the oldest by age and are also visiting the most. Cluster zero are the youngest and are visiting much less and also have the shortest lifetime. The zero cluster has the highiest churn rate of 51.52%, cluster 2 follows him with 44.44% and cluster 1 with 26.68%. Clusters 3 and 4 have lowest churn rate of 7.24% and 2.76% respectively. So clusters 3 and 4 are the most loyal and 0, 2, 1 are prone to leaving.